Web Page Title Extraction and Its Application1

نویسندگان

Yewei Xue

Yunhua Hu

Guomao Xin

Ruihua Song

Shuming Shi

Yunbo Cao

Chin-Yew Lin

Hang Li

چکیده

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ̳definition‘ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (Direct Object Model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6%-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1% 30.3% improvements).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Information Extraction from HTML: Application of a General Machine Learning Approach

Because the World Wide Web consists primarily of text information extraction is central to any e ort that would use the Web as a resource for knowledge discov ery We show how information extraction can be cast as a standard machine learning problem and argue for the suitability of relational learning in solving it The implementation of a general purpose relational learner for information extrac...

متن کامل

Title-Block Based Web Page Reorganization

For cell phone users and blind people using non-visual browsers, browsing Web by common browsers is quite inefficient due to the problem of information overload. This paper presents the TB-WPRO (Title-Block based Web Page Re-Organization) method, which hierarchically segments web pages into blocks using visual and layout information reflecting the web designers’ intent. TB-WPRO segments the web...

متن کامل

Extracting News Web Page Creation Time with DCTFinder

Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title ...

متن کامل

Performance Analysis of Vision-based Deep Web Data Extraction for Web Document Clustering

Web Data Extraction is a critical task by applying various scientific tools and in a broad range of application domains. To extract data from multiple web sites are becoming more obscure, as well to design of web information extraction systems becomes more complex and time-consuming. We also present in this paper so far various risks in web data extraction. Identifying data region from web is a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Web Page Title Extraction and Its Application1

نویسندگان

چکیده

منابع مشابه

Data Extraction using Content-Based Handles

Information Extraction from HTML: Application of a General Machine Learning Approach

Title-Block Based Web Page Reorganization

Extracting News Web Page Creation Time with DCTFinder

Performance Analysis of Vision-based Deep Web Data Extraction for Web Document Clustering

عنوان ژورنال:

اشتراک گذاری